Coresets for Discrete Integration and Clustering

نویسنده

  • Sariel Har-Peled
چکیده

Given a set P of n points on the real line and a (potentially in nite) family of functions, we investigate the problem of nding a small (weighted) subset S ⊆ P , such that for any f ∈ F, we have that f(P ) is a (1 ± ε)-approximation to f(S). Here, f(Q) = ∑ q∈Q w(q)f(q) denotes the weighted discrete integral of f over the point set Q, where w(q) is the weight assigned to the point q. We study this problem, and provide tight bounds on the size S for several families of functions. As an application, we present some coreset constructions for clustering.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Scalable and Distributed Clustering via Lightweight Coresets

Coresets are compact representations of data sets such that models trained on a coreset are provably competitive with models trained on the full data set. As such, they have been successfully used to scale up clustering models to massive data sets. While existing approaches generally only allow for multiplicative approximation errors, we propose a novel notion of coresets called lightweight cor...

متن کامل

On the Sensitivity of Shape Fitting Problems

In this article, we study shape fitting problems, -coresets, and total sensitivity. We focus on the (j, k)-projective clustering problems, including k-median/k-means, k-line clustering, j-subspace approximation, and the integer (j, k)-projective clustering problem. We derive upper bounds of total sensitivities for these problems, and obtain -coresets using these upper bounds. Using a dimension-...

متن کامل

Turning big data into tiny data: Constant-size coresets for k-means, PCA and projective clustering

We prove that the sum of the squared Euclidean distances from the n rows of an n×d matrix A to any compact set that is spanned by k vectors in R can be approximated up to (1+ε)-factor, for an arbitrary small ε > 0, using the O(k/ε)-rank approximation of A and a constant. This implies, for example, that the optimal k-means clustering of the rows of A is (1+ε)approximated by an optimal k-means cl...

متن کامل

Distributed Balanced Clustering via Mapping Coresets

Large-scale clustering of data points in metric spaces is an important problem in mining big data sets. For many applications, we face explicit or implicit size constraints for each cluster which leads to the problem of clustering under capacity constraints or the “balanced clustering” problem. Although the balanced clustering problem has been widely studied, developing a theoretically sound di...

متن کامل

Strong Coresets for Hard and Soft Bregman Clustering with Applications to Exponential Family Mixtures

Coresets are e cient representations of data sets such that models trained on the coreset are provably competitive with models trained on the original data set. As such, they have been successfully used to scale up clustering models such as K-Means and Gaussian mixture models to massive data sets. However, until now, the algorithms and the corresponding theory were usually specific to each clus...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006